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Abstract 

We develop and analyze stochastic optimization algorithms for problems in which the ex- 
pected loss is strongly convex, and the optimum is (approximately) sparse 1 . Previous approaches 
are able to exploit only one of these two structures, yielding an O(dfT) convergence rate for 
strongly convex objectives in d dimensions, and an 0{y/ (slog d)/T) convergence rate when the 
optimum is s-sparse. Our algorithm is based on successively solving a series of ^-regularized 
optimization problems using Nesterov's dual averaging algorithm. We establish that the error 
of our solution after T iterations is at most C((s log d)/T), with natural extensions to approx- 
imate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, 
hinge and least-squares losses. By recourse to statistical minimax results, wc show that our 
convergence rates are optimal up to multiplicative constant factors. The effectiveness of our 
approach is also confirmed in numerical simulations, in which we compare to several baselines 
on a least-squares regression problem. 

1 Introduction 

Stochastic optimization algorithms have many desirable features for large-scale machine learning, 
and accordingly have been the focus of renewed and intensive study in the last several years (e.g., 
see the papers (26J HJ [TU], [30] and references therein). The empirical efficiency of these methods 
is backed with strong theoretical guarantees, providing sharp bounds on their convergence rates. 
These convergence rates are known to depend on the structure of the underlying objective function, 
with faster rates being possible for objective functions that are smooth and/or (strongly) convex, 
or optima that have desirable features such as sparsity. More precisely, for an objective function 
that is strongly convex, stochastic gradient descent enjoys a convergence rate ranging from 0(1/T), 
when features vectors are extremely sparse, to 0{d/T) when feature vectors are dense |11 [ I19| , [12] . 
Such results are of significant interest, because the strong convexity condition is satisfied for many 
common machine learning problems, including boosting, least squares regression, support vector 
machines and generalized linear models, among other examples. 

A complementary type of condition is that of sparsity, either exact or approximate, in the 
optimal solution. Sparse models have proven useful in many application areas (see the overview 
papers [TBI [5] and references therein for further background), and many optimization-based 
statistical procedures seek to exploit such sparsity via £i-regularization. A significant feature of 
optimization algorithms for sparse problems is their mild logarithmic scaling with the problem 
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dimension [201 [27} [TOl I30j . More precisely, it is known |20l [27] that when the optimal solution 
9* has at most s non-zero entries, appropriate versions of the stochastic mirror descent algorithm 
converge at a rate 0(s-\J (log d)/T). Srebro et al. [28] exploit the smoothness of many common 
loss functions; in application to sparse linear regression, their analysis yields improved rates of the 
form 0(r\\J (s log d) /T), where n is the noise variance. While the \/log d scaling of the error makes 
these methods attractive in high dimensions, observe that the scaling with respect to the number of 
iterations is relatively slow — namely, 0(1/ VT) as opposed to the 0(1/T) rate possible for strongly 
convex problems. 

Many optimization problems encountered in practice exhibit both features: the objective func- 
tion is strongly convex, and the optimum is sparse, or more generally, well-approximated by a 
sparse vector. This fact leads to the natural question: is it possible to design an algorithm for 
stochastic optimization that enjoys the best features of both types of structure? More specifi- 
cally, the algorithm should have a 0(1/T) convergence rate, but simultaneously enjoy the mild 
logarithmic dependence on dimension. The main contribution of this paper is to answer this ques- 
tion in the affirmative, and in particular, to analyze a new algorithm that has convergence rate 
0((s log d) /T) for a strongly convex problem with an s-sparse optimum in d dimensions. Moreover, 
using information-theoretic techniques, we prove that this rate is unimprovable up to constant fac- 
tors, meaning that no algorithm can converge at a substantially faster rate. 

The algorithm proposed in this paper builds off recent work on multi-step methods for strongly 
convex problems |14L I12|. [T5] . but involves some new ingredients that are essential to obtain optimal 
rates for statistical problems with sparse optima. In particular, instead of performing updates on 
the same objective, we form a sequence of objective functions by decreasing the amount of regu- 
larization as the optimization algorithm proceeds. From a statistical viewpoint, this reduction is 
quite natural: at the initial stages, the algorithm has seen only a few samples, and so should be 
regularized more heavily, whereas at later stages when the effective sample size is much larger, the 
regularization should be appropriately attenuated. Each step of our algorithm can be computed 
efficiently with a closed form update rule in many common examples. In summary, the outcome of 
our development is an optimal one-pass algorithm for many structured statistical problems in high 
dimensions, and with computational complexity linear in the sample size. Numerical simulations 
confirm our theoretical predictions regarding the convergence rate of the algorithm, and also es- 
tablish its superiority compared to regularized dual averaging [30] and stochastic gradient descent 
algorithms. They also confirm that a direct application of the multi-step method of Juditsky and 
Nesterov |14] is inferior to our algorithm, meaning that our gradual decrease of regularization is 
quite critical. In order to keep our presentation focused, we restrict our attention to multi-step 
variants of the dual averaging algorithm; however, it is worth noting that similar results can also 
be achieved for mirror descent as well as Nesterov's accelerated gradient methods [21] by combin- 
ing our results with recent work in the optimization literature [15]. Although this paper focuses 
exclusively on problems involving the recovery of a sparse vector, similar ideas can be extended 
extend to other low-dimensional structures such as group-sparse vectors and low-rank matrices, as 
discussed in the statistical context [18] . 
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2 Problem set-up and algorithm description 



Given a subset OCR" and a random variable Z taking values in a space Z % we consider an 
optimization problem of the form 

9* E argminE[£(0;Z)], (1) 
een 

where £ : £1 x Z — ?■ K is a given loss function. As is standard in stochastic optimization, we do not 
have direct access to the expected loss function C{9) := E[£(0; Z)], nor to its subgradients. Rather, 
for a given a query point 9 S f2, we observe a stochastic subgradient, meaning a random vector 
g(6) € K rf such that E[g(0)] G dC{9). We then seek to approach the optimum of the population 
objective £ using a sequence of these stochastic subgradients. 

The goal of this paper is to design algorithms that are suitable for solving the problem ([I]) when 
the optimum 9* is sparse. In the simplest case, the optimum 9* might be exactly s -sparse, meaning 
that it has at most s non-zero entries. Our analysis is actually much more general than this exact 
sparsity setting, in that we provide oracle inequalities that apply to arbitrary vectors, and also 
guarantee fast rates for vectors that are approximately sparse. More precisely, for any given subset 
S C {1, . . . , d} of cardinality \S\ = s, we provide upper bounds on the optimization error that scale 
linearly with s, and also involve the residual term ||0<j c ||i := X]jeS c ^ or a § enera l optimum 
9*, the best bound is obtained by choosing the subset S appropriately so as to balance these two 
contributions to the error. 

2.1 Algorithm description 

In order to solve a sparse version of the problem (JT]), our strategy is to consider a sequence of 
regularized problems of the form 

A = argmin/ A (0) where f x (0) := C{9) + X\\9\\x. (2) 

Given a total number of iterations T, our algorithm involves a sequence of Kt different epochs, 
where the regularization parameter A > and the constraint set Q' C ft change from epoch to 
epoch. More precisely, the epochs are specified by: 

• a sequence of natural numbers {Tj}^, where Tj specifies the length of the i th epoch and 
E[=i Ti = T, 

• a sequence of positive regularization weights {Aj}jJ^, and 

• a sequence of positive radii and d-dimensional vectors which specify the 
constraint set 

Q(Ri) := {0 e fi | \\9 - ft || p <Ri} (3) 

that is used throughout the i th epoch. 

We initialize the algorithm in the first epoch with yi = 0, and with any radius R\ that is an 
upper bound on ||0*||i. The norm || • || p used in defining the constraint set £l{Ri) is specified by 
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p = 9^°!^ , where the rationale for this particular choice will be provided momentarily. 

At a high level, the goal of the i th epoch is to perform the update y\ h-» Ui+i, in such a way 
that we are guaranteed that ||yj+i — 9*\\\ < R^ +1 for each i = 1,2,.... We choose the radii so 
as to decay geometrically as Rf + i = Rf/2, so that upon termination of the Ktp epoch, we have 
Vk t — < Rf/2 hr . In order to perform the update yi i-)- yi+i, we run Tj rounds of the 
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stochastic dual averaging algorithm |22j on the regularized objective function 



min {£(0) + Ai||0||i}. (4) 

eeQ(Ri) 

When applied to this objective function in the i th epoch, the dual averaging algorithm operates 
on stochastic subgradients of the cost function C{9) + Aj||0||i, and using a sequence of step sizes 
{ a *}2=0' generates two sequences of vectors {^}^i and {9 t }J^ , initialized as /x° = and 9° = y^. 
At iteration t = 0, 1, . . . , Tj, we let g l be a stochastic subgradient of C at and we let v be any 
element of the subdifferential of the £i-norm || • ||i at 6 l . Consequently, the vector E[g*] + Aji/* is an 
element of the sub-differential of C{9) + Aj||#||i at 9 l . The stochastic dual average update at time 
t performs the mapping (//,#*) i— > (fi t+1 ,9 t+1 ) via the recursions 

H t+1 = // + g l + \ iV \ and (5a) 
9 t+1 = arg min {a t+1 (S + \ 0} + VwA (<?)}, (5b) 

where the prox function tp is specified below ([6]). The pseudocode describing the overall procedure 
is given in Algorithm [TJ 

In the stochastic dual averaging updates ([5]), we use the prox function 

^ i MO) = 2{p ^ m \\0-yi\\l, where p = (6) 

This particular choice of the prox-function and the specific value of p ensure that the function tp 
is strongly convex with respect to the ^i-norm; it has been used previously for sparse stochastic 
optimization (see e.g. |20 } 127 ^ 19]). In most of our examples, we consider the parameter space Q = M rf 
and owing to our choice of the prox-function and the feasible set in the update (|5bp , we can compute 
Qt+i f rom i n closed form. See Appendix [A] for further details. 

It is worth noting that our update rule, in taking the subgradient of the ^i-norm, is different from 
previous approaches inspired by Nesterov's composite minimization strategy pT], which compute 
a prox-mapping involving the £i-norm [101 [30], [9]. Our results do extend in an obvious way to 
computing such an exact composite prox-mapping. However, even when Q = M. d , computing this 
exact prox-mapping with the t v norm constraint in our update rule (|5bp has a complexity 0(d 2 ), 
which is prohibitive in high dimensions. In contrast, our update enjoys an 0{d) complexity. 

2.2 Conditions 

Having defined our algorithm, we now discuss the conditions on the objective function C{9) = E[£(#; Z)\ 
and stochastic gradients that underlie our analysis. We begin with two conditions on the objective 
function. 
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Algorithm 1 Regularization Annealed epoch Dual AveRaging (RADAR) 



Require: Epoch length schedule {Ti} i= ^, initial radius R\, step-size multiplier a, prox-function 
tp, initial prox-center y\, regularization parameters Aj. 
for Epoch i = 1, 2, . . . , Kt do 
Initialize fjP = and 0° = yt. 
for Iteration t = 0, 1, . . . , Tj — 1 do 

Update (/u',0*) h-> ( / u' +1 ,0' +1 ) according to rule © with step size a 1 = a/y/i. 
end for 

Set = . 

Update = i2?/2. 
end for 
Return yx T +i 



Assumption 1 (Local Lipschitz condition). For each i? > 0, there is a constant G = G(R) such 
that 

|£(0) -£(0)| < G||0-0||i (7) 
for all 0,9 € such that ||0 - 0*\\i < R and \\9 - 9*\\i < R. 

For instance, this condition holds whenever ||V£(0)|| OO < G for all 9 such that \\9\\i < R. In the 
sequel, we provide several examples of loss functions whose gradients are bounded in this £oo-sense. 

As mentioned, our goal is to obtain fast rates for objective functions that are (locally) strongly 
convex. Accordingly, our next step is to provide a formal definition of this condition: 

Assumption 2 (Local strong convexity (LSC)). The function C : £1 — > R satisfies an i?-local form 
of strong convexity (LSC) if there is a non-negative constant 7 = "f(R) such that 

C{9) > C{9) + <V£(0), 9-9) + 1\\9 - 9\\l (8) 
for any 9,9 ett with ||0||i < R and ||0||i < R. 

Some of our results concerning stochastic optimization for finite pools of examples are based upon a 
further weakening of the local strong convexity condition, which will be referred to as local restricted 
strong convexity (see equation (|28p ). Such conditions have been used before in both statistical and 
optimization-theoretic analyses of sparse high-dimensional problems [2 El El El] • 

Our final condition concerns the mechanism that produces the stochastic subgradient g(9) of 
the cost function C at £ fL It is a tail condition on the error e(0) := g{9) — E[g(0)]. 

Assumption 3 (Sub-Gaussian stochastic gradients). For all such that ||0 — 0*||i < R, there is a 
constant a = <j(R) such that 

E[exp(||e(0)||L/a 2 )] <exp(l). (9) 
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Clearly, this condition holds whenever the error vector e{9) has bounded components. More gen- 
erally, the bound © holds with a 2 = 0(\ogd) whenever each component of the error vector has 
sub-Gaussian tailsj^j 

2.3 Some illustrative examples 

In this section, we describe some examples that satisfy the previously stated conditions. These 
examples also help to clarify how the parameters of interest can be computed or bounded in 
different applied scenarios. 

Example 1 (Classification under Lipschitz losses). In binary classification, the samples consist of 
pairs z = (x,y) E M d x {—1,1}. The vector x represents a set of d features or covariates, used 
to predict the class label y. One way in which to predict the label is via a linear classifier, which 
makes classification decisions according to the rule x \— > sign((0, x)). The vector of weights 9 G M. d 
is estimated by minimizing an appropriately chosen loss function, of which many take the form 
C(9; z) = (ft(y {9, x)) for a function (f) : R — > M+. Common choices include the hinge loss function 

(fi hin (a) := max{l — a, 0}, (10) 

V v ' 

(!-«)+ 

which underlies the support vector machine, or the logistic loss function log (a) = log(l + exp(— a)). 

Given a distribution P over Z, which can be either the population distribution or the empirical 
distribution over a finite sample, a common strategy is to draw (xt, yt) ~ IP at iteration t, and use 
the (stochastic) subgradient g = V C(9; (x t ,yt)) = xtyt<t>' '(yt{9 , £*))• We now illustrate how the 
assumptions of Section 12.21 are satisfied in this setting. 

• Locally Lipschitz: In both of the above examples, the underlying function (f> is actually 
globally Lipschitz with parameter one. Thus, we have the bound 

G<E[\<j>'(y{e, x))\ ||z|U < EHxIlco. 

Often, the data satisfies the normalization ||x||oo ^ &i m which case we get G < B. More gen- 
erally, we often have sub-Gaussian or sub-exponential tail conditions [6] on the marginal dis- 
tribution of each coordinate of x, in which case the same condition holds with G = 0(\/\og d)) 
or G = Oilogd) respectively. 

• LSC: When the expectation ([I]) is under the population distribution, the above examples 
satisfy the local strong convexity condition. Here we focus on the example of the logistic loss, 
where ip( a ) := ^(og^) = exp(a)/(l + exp(a)) 2 is its second derivative. 

Considering the case of zero-mean covariates, and letting cr m i n (S) denote the minimum eigen- 
value of the covariance matrix E = E[xx T ], a second-order Taylor series expansion yields 

E(§) - 1{9) - <v£(0), e-9) = WMl\\&/*(p - *)||2 > tm^l p _ s\\i 

where 9 = a9 + (1 — a) 9 for some a G (0, 1). Note that the lower bound (i) follows from 
Holder's inequality — that is, \(9,x)\ < ||0||i||a;||oo < RB. Putting together the pieces, we 

X A zero mean random variable Z is sub-Gaussian with parameter 7 if E[e tz ] < exp(7 2 £ 2 /2) for all tel. 
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conclude that 7 > ip(BR)a m i n CE) in this example. When sampling from a finite pool, we 
require an analogous condition, known as restricted strong convexity, to hold for the sample 
version; see Section [3.1.31 for further details. 

• Sub-Gaussian gradients: For covariates bounded in expectation E|| 

*^||oo — B , this condi- 
tion is also relatively straightforward to verify. A simple calculation identical to the verifica- 
tion of the Lipschitz condition above yields that 

IK^IU = ||V£(0; (x,y)) - V£(9)\\oo < ||V£(0; (ac,y))||oo + l|V£(#)||oc < 2B. 

Thus, by setting a 2 = (2B) 2 , we find that Eexp ^ ^^l 00 j < exp(l). If instead of a bound- 
edness condition, we assume that elements of the vector a have sub-Gaussian tails, then 
the condition will hold with a 2 = O(logd), using standard results on sub-Gaussian maxima 
(e.g., IS]). 

We now turn to the problem of least-squares regression. 

Example 2 (Least-squares regression). In the regression set-up, the samples are of the form 
z = (x, y) £ M rf x R, and the least-squares estimator is obtained by minimizing the quadratic loss 
C(9;(x,y)) = (y — (0,x)) 2 /2. To illustrate the conditions more clearly, let us suppose to start, 
relaxing this condition momentarily, that our samples are generated according to a linear model 

y = (x,9*) + w, (11) 

where w ~ AA(0, rj 2 ) is observation noise, and the covariate vectors x are zero-mean with covariance 
matrix £ = E[x2; T ]. Under this condition, we have 

L{9) = E[£(9;(x,y))] = i(0 - 0*) T E(0 - 0*) = \\\T}/ 2 {6 - 

Consequently, the minimizer of C is given by 6*, the true parameter in the linear model (jlip . We 
now proceed to verify that our conditions from Section 12.21 are satisfied for this model. 

• Locally Lipschitz: For the quadratic loss, we no longer have a global Lipschitz condition; 
instead, the local Lipschitz parameter G(R) depends on the radius R, and the covariance 
matrix S via the quantity p(S) := m&Xj More specifically, we have 

(a) (6) 

G(R) < ||E(0-0*)||oo < max|E ifc | ||0-0*||i < p(E)R, 

where step (a) exploits Holder's inequality, and inequality (b) follows since E is a positive 
semidefinite matrix. 

• LSC: Again, let us consider the case when C is defined via an expectation taken under the 
population distribution. We then have 

10) _ Z{6) - <V£(0), 0-9)= \^ 1/2 (^- e )W 2 2 > p _ §\g 

so that 7 = <7 m i n (£) in this example. 



7 



• Sub-Gaussian gradients: Once again we assume that the design is bounded in -cqq- norm, 
that is Halloo ^ B. It can be shown that Assumption [3l is satisfied with 

a 2 (fl) = 24b 4 r 2 + 36#V ■ 

See Section [4.51 for details of this calculation. 

In practice, the linear model assumption {j lip is not likely to hold exactly, but the validity of our 
three conditions can still be established under reasonable tail conditions on the covariate-response 
pair. In particular, it can be shown that same local Lipschitz condition continues to hold with 
G(R) = 2p(T,)R, and the RSC condition also remains unchanged. In order to establish the sub- 
Gaussian condition (Assumption [3]) , we need to make assumptions about the tail behavior of our 
samples (xt,yt)- It suffices to assume that the distribution of the vectors x and the conditional 
distribution Y\X is also sub-Gaussian. Under these conditions, obtaining explicit bounds on u% 
in terms of these sub-Gaussian parameters is analogous to our calculations above, and is omitted 
here. 



3 Main results and their consequences 

We are now in a position to state the main results of this work, regarding the convergence properties 
of Algorithm [TJ Below we present two sets of results. Our first result (Theorem [1]) applies to 
problems for which the Lipschitz and sub-Gaussianity assumptions hold over the entire parameter 
set £1, and the RSC assumption holds uniformly for all 6 satisfying \\9\\i < Ri, where R\ is the 
initial radius. Examples include classification with globally Lipschitz losses, such as the hinge and 
logistic losses discussed in Example [TJ Our second result (Theorem [2]) applies to least-squares 
loss, which is not Lipschitz on M. d , and requires a somewhat more delicate treatment. Both our 
results follow from a common machinery, and build off of standard convergence results for the dual 
averaging algorithm [221130]. For stating our results, we will assume that Assumptions [TJ and [3] hold 
with constants Gj := G(Ri) and crj := o{Ri) at epoch i. Given a constant u > governing the 
probability of error in our results, we also define oj 2 = oj 2 + 24 log i at epoch i. Both the theorems 
below are based on the choice of epoch lengths 



Ti := ci 



s 2 



where c\ is a suitably chosen universal constant. 



{{G 2 + a 2 )\ogd + u 2 a 2 )+\ogd 



(12) 



3.1 Optimal rates for Lipschitz losses 

We begin with a setting quite standard in the optimization literature, in which the loss function 
is globally Lipschitz and the noise in our stochastic gradients is uniformly sub-Gaussian. More 
formally, we assume that there are constants (G, a) such that, independently of the choice of radius 
R, Assumptions [TJ and [3] hold with G(R) = G and cr(R) = a. There are many common examples 
in machine learning where these assumptions are met, some of which are outlined in Example [TJ 
We also use 7 to denote the strong convexity constant j(Ri) in Assumption [2j 

For a total of T iterations in Algorithm (TJ we state our results for the parameter 9? = U(k t +i) 
where we recall that Kt is the total number of epochs completed in T iterations. 
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3.1.1 Main theorem and some remarks 



Our main theorem provides an upper bound on the convergence rate on our algorithm as a function 
of the iteration number T, dimension d, strong convexity constant 7, Lipschitz constant G, and 
some additional terms involving the sparsity of the optimal solution 9*. More precisely, for each 
subset S C {1,2, ... ,d} of cardinality s, we define the quantity 



■■S) : 



(13) 



where \\9* 



£3 



quantity can be used to measure the degree of sparsity in the optimum 9 



is the ^i-norm of terms outside S. The behavior of this 

For instance, we have 



E (9*;S) = if and only if 9* is supported on S. Given a constant oj > 0, we also define the 



shorthand 



K T = log 2 



7 2 J R?T 



„s 2 ((G 2 + a 2 ) log d + ca 2 a 2 ) 
With this notation, we have the following result: 



log d. 



(14) 



Theorem 1. Suppose the expected loss C satisfies Assumptions Q] — [3]with parameters G(R) = G, 
7 and <j{R) = cr respectively, and we perform updates ^ with epoch lengths (|12|) . and regulariza- 
tion/step length parameters 



S\/Ti 



(G 2 + a 2 ) log d + w 2 cr 2 , and a* = 5Ri 



log d 



(G 2 + A 2 + a 2 )t" 



(15) 



Then for any subset S C {1, . . . , d} of cardinality s and any T > 2kt, there is a universal constant 
cq such that 



\\9 T - 9* 



< 



7 2 T 



(G 2 + a 2 )log(i + a 2 (w 2 +log 



\ogd 



+ e 2 (9*;S) 



(16) 



with probability at least 1 — 6exp(— u 2 /12). 



As with earlier work on multi-step methods for strongly convex objectives |14l 115] , the theorem 
predicts an overall convergence rate of 0{—^); under our assumptions, this rate is known to be 
the best possible [20J. Apart from this scaling, the other interesting factors in the bound are the 
logarithmic scaling in the dimension d, and the trade-off between the two terms: the first of which 
scales linearly in a chosen sparsity level s, and the second term e 2 (9*;S) represents a form of 
approximation error. As a concrete instance, if the optimum 9* is actually supported on a subset 
A C {1,2, ... ,d} of cardinality s = \A\, then choosing 5 = A in the bound (|16p yields an overall 
convergence rate of 0( sl ^ )■ 

It is worthwhile comparing the convergence rate in Theorem [T] to alternative methods. A 
standard approach to minimizing the objective (UJ would be to perform stochastic gradient descent 
directly on the objective, instead of considering our sequence of regularized problems ([2]). Under 
the assumptions of Theorem [JJ the expected loss is strongly convex with respect to the ^2-norm, so 
that stochastic gradient descent (SGD) would converge at a rate 0{(G 2 + a 2 ) / (^ 2 T)), for constants 
G and a chosen to satisfy the bounds 



||V£(0)|| 2 < G, and E 



exp(|| e (0)|| 2 /5 2 )j <exp(l). 



(17) 
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Under the assumptions of Example [JJ we find that it suffices to choose G 2 = Bd and similarly 
for a, so that SGD would converge at rate 0{-4f). This generic guarantee scales linearly in the 
problem dimension d, and fails to exploit any sparsity inherent to the problem. The key difference 
between this naive application of stochastic gradient descent and our approach is that since we 
minimize a regularized objective (|2|), our iterates tend to be (approximately) sparse. As a result, 
we have a form of local strong convexity not only in £2-norm but also with respect to ^i-norm; 
this is a key observation in exploiting sparsity and strong convexity at the same time. Another 
standard approach is to perform mirror descent or dual averaging, using the same prox-function as 
Algorithm QJ but without breaking it up into epochs. As mentioned in the introduction, this vanilla 
single-step method fails to exploit the strong convexity of our problem and obtains the inferior 
convergence rate 0(s -^/log d/T). 

A different proposal, closer in spirit to our approach, is to minimize a similar regularized ob- 
jective ([2]), but with a fixed value of A instead of the decreasing schedule of Aj used in Theorem QJ 
In fact, it can be obtained as a simple consequence of our proofs that setting A = dydog d/T leads 

to an overall convergence rate of O sl< ^ d ^j , a result analogous to the guarantee of Theorem [TJ 

Indeed, this procedure can be understood as applying the algorithm of Juditsky and Nesterov [TT] 
to the problem (TJJ), but where the bounds are obtained using the additional technical machinery 
introduced in this paper. However, with this fixed setting of A, the initial epochs tend to be much 
longer than required for reducing the error by a factor of one half. Indeed, our setting of A is based 
on minimizing the upper bound at each epoch, and leads to substantially improved performance 
in our numerical simulations as well. The benefits of slowly decreasing the regularization in the 
context of deterministic optimization were also noted in the recent work of Xiao and Zhang [31] . 

It is instructive to further simplify the the bounds by making further assumptions, allowing us 
to quantify these terms concretely. We do in the following sections. 



3.1.2 Some illustrative corollaries 



We start with a corollary for the setting where the optimum 6* is supported on a subset S of 
cardinality s, where s is a sparsity index between 1 and d. For these corollaries, so as to facilitate 
comparison with minimax lower bounds, we use 5 = w 2 /(logd) as the parameter in specifying the 
high-probability guarantees. Under these conditions, we recall our earlier notation kt (114p further 
simplifies to 

KT = log 2 — — 1 log d. 

[s 2 logd(G 2 + cH(l + 6)) J 

Within this setup, we have the following corollary of Theorem [TJ 

Corollary 1. Under the conditions of Theorem^ assume further that 9* takes non-zero values 
only on a subset S C {1, . . . , d} of size s. Then for all T > 2kt, there is a universal constant cq 
such that 



\\9t 



?*||| < co 



{G 2 + a 2 {l + 5)} slogd 



r 



SO" j Kt 

7 2 T ^ log d 



(18) 



with probability at least 1 — 6exp(— <51ogd/12). 

The corollary follows directly from Theorem [TJ by noting that e 2 {9*;S) = under our assump- 
tions. It is useful to note that the results on recovery for generalized linear models presented here 
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match (up to 0(loglogT factors) those that have been developed in the statistics literature |18|l29j. 
which are optimal under the assumptions on the design vectors. 

Theorem Q] also applies to the case when the optimum 0* is not exactly sparse, but only approx- 
imately so. Such notions of approximate sparsity can be formalized by enforcing a certain decay 
rate on the magnitudes, when ordered from smallest to largest. Here we consider the notion of 
•^-"ball" sparsity: given a parameter q G (0, 1] and a radius R q , consider the set of all vectors such 
that 

d 

M q (R q ) := [e G R d | ]T \6j\i < R^. (19) 

3=1 

For q = 1, this set reduces to an ^i-ball, whereas for q G (0, 1), it is a non-convex but star-shaped 
set contained within the £i-ball. With these assumptions, our earlier notation kt further simplifies 
to 

r R\ ( 1 2 T 



K T = log 2 



log d. 



\\0 T - 0*U < co R, 



'i 



R\ Vlogd(G 2 + a 2 (l + (5)), 

The following corollary captures the convergence of our updates for such problems. 

Corollary 2. Under the conditions of Theorem^ suppose moreover that 9* G M q (R q ) for some 
q G (0, 1]. Then there is a universal constant cq such that for all T > 2kt, we have 

Hi+smogd ] 1 -^ 2 ff_ y- 9/2 Rq 

7 2 T J WTJ ((1 + 5) log d)9/ 2 ° g logd 

(20) 

with probability at least 1 — 6exp(— 5logd/12). 

Note that as q ranges over the interval [0, 1] , reflecting the degree of sparsity, the convergence 
rate ranges from 0(1/T) for q = corresponding to exact sparsity, to 0(1/VT) for q = 1. This is a 
rather interesting trade-off, showing in a precise sense how convergence rates vary quantitatively as 
a function of the underlying sparsity. While it might seem like the worsening of rates as q increases 
towards one defeats our original goal of obtaining fast rates by leveraging strong convexity of our 
problem, this phenomenon is unavoidable due to existing lower bounds. More specifically, the 
results on recovery for generalized linear models presented here exactly match those that have been 
developed in the statistics literature [TU [29], which are optimal under our assumptions on the design 
vectors. The reason for this phenomenon is that our goal of obtaining logarithmic dependence with 
the dimension d requires strong convexity of the objective with respect to £i-norm, while our LSC 
assumption only guarantees strong convexity with respect to the ^-norm. For a sparse optimum 
9*, the local strong convexity assumption also translates into the desired £i-strong convexity, but 
the constant deteriorates as q increases from zero to one. 

3.1.3 Stochastic optimization over finite pools 

A common setting for the application of stochastic optimization methods in machine learning is 
when one has a finite pool of examples, say {z±, . . . , z n }, and the goal is to compute 

1 n 

9* G argminf-V C(0',Zi)\. (21) 

i=l 
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In this setting, a stochastic gradient g(6) can be obtained by drawing a sample Zj at random with 
replacement from the pool {z\, . . . , z n }, and returning the gradient VC(6; zj), which is unbiased as 
an estimate of the gradient of the sample average (|2ip . 

In many applications, the dimension d is substantially larger than the sample size n, in which 
case the sample loss defined above can never be strongly convex. However, it can be shown [23^ [T8] 
that under suitable a condition, the sample objective (12ip does satisfy a suitably restricted form of 
the LSC condition formalized in Assumption [21 one that is valid even when d^> n. As a result, the 
generalized form of Theorem [T] we provide in Section 14.31 does apply to this setting as well and we 
can obtain the following corollary. We will present this result only for settings where 6* is exactly 
sparse, the extension to approximate sparsity is identical to the above discussion for obtaining 
Corollary [2] from Corollary [U Moreover, we also specialize to the logistic loss 

C(6; (x, y)) := log(l + exp(-y(0, (22) 

which suffices to illustrate the main aspects of the result. We also introduce the shorthand 
ip(a) = exp(a)/(l + exp(a)) 2 , corresponding to the second derivative of the logistic function. Be- 
fore stating the corollary, we state a condition on the design that is needed to ensure the RSC 
condition. The condition is stated on the design matrix X £ M. nxd with xj as its i th row. 

Assumption 4 (Sub-Gaussian design). The design matrix X is sub-Gaussian with parameters 
(S,^) if 

(a) Each row Xi £ M. d is sampled independently from a zero-mean distribution with covariance 
S, and 

(b) For any unit-norm vector u € M. d , the random variable (u, xi) is sub-Gaussian with parameter 
rj x , meaning that E[exp(t(u, £«))] < exp(t 2 r/^/2) for all t € R. 



In this setup, our definition of kt (I14h is modified to 



K T = log 2 



al lD {^\2BR l )R\T 



s 2 J B 2 (5 + 4<5)logd 
We can now state a convergence result for this setup. 



log d. 



Corollary 3 (Logistic regression for finite pools). Consider the finite-pool loss (|2ip . based on n i.i.d. 
samples from a sub-Gaussian design with parameters (S,r/ 2 ). Suppose further that Assumptions^ 
and\M are satisfied and the optimum 9* of the problem (I2ip is s-sparse. Then there are universal 
constants (co, c±, C2, C3) such that for all T > 2kt and n > C3 ] og £,, max((j^ lin (E), rj^.), we have 

^min \ / 

ft - HH < £^{_L_ {B2(1 + „, } + C o <|n(s -; Mi)r (23, 

with probability at least 1 — 2exp(— cinmin(<7 2 nin (E)/?7^, 1)) — 6exp(— <51ogd/12). 

Once again we observe optimal dependence on the quantities s, logd, T and ^^(S) in our 
convergence rate. For the purposes of optimization, a dependence on the strong convexity of the 
loss through 1 / ip 2 (2B R\) also seems unavoidable. Indeed, the lower bound of Agarwal et al. [I] for 
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the complexity of stochastic convex optimization with strongly convex objectives implies that such 
a scaling is necessary for any stochastic first-order method. While the result in their Theorem 2 is 
stated in terms of C{6t) — C(8*) and posits a I/7 scaling, it can be easily extended to also imply 
a I/7 2 scaling for the error ||#(T) — 9*\\?,. Finally, we observe that the bound only holds once the 
number of samples n in the objective (|2ip is large enough. This arises since the sample objective 
is not strongly convex by itself, but does satisfy a restricted version of the LSC condition once the 
sample size is large enough. These ideas are further clarified in the proof of the corollary that we 
present in Section I4.4.2L 



3.2 Optimal rates for least squares regression 

In this section, we specialize to the case of least squares regression described previously in Exam- 
ple [2j For ease of presentation, we further assume that the linear model assumption (jlip holds. 
Since the least-squares cost function is not Lipschitz over the entire set f2, we need the general 
local setting of our assumptions. For brevity, we introduce the shorthand notation Gi = G(2Ri) 
and Gi = a(2Ri), and note that all of these parameters now depend on the epoch i. 

The following theorem characterizes the convergence rate of Algorithm[T]for least-squares regres- 
sion, when applied to independently and identically distributed (i.i.d.) samples generated from the 
linear model (jllh with S-bounded covariates (i.e., ||x||oo < B with probability one), and additive 
Gaussian noise with variance r] 2 . For this example, we (re)define 



s 2 B A + 1 2 , 



V 



00 +logd)log 2 



s 2 t] 2 B 2 (uj 2 + log d) 



(24) 



In stating the result, we make use of the shorthand 



£r := log 



K T 7 



(s 2 5 4 + 7 2 )(w 2 + logd) 



Theorem 2. Consider the updates ([5]) with epoch lengths (|12|) and regularization/stepsize param- 
eters 



s\/Ti 



(G 2 + a 2 ) log d + 00 2 a 2 and ot = 5R, 



log d 



(G 2 + Xj + a 2 )t 



(25) 



Then there is a universal constant cq such that for any T > 2kt and for any subset S of {1, . . . , d} 
of cardinality s, we have 



\\6t 



3*112 



< 



CO 



<in(S) 



T 



u 2 + logd + ^ T )+e 2 (9*;S) 



with probability at least 1 — 6exp(— cj 2 /12). 

Once again, if we focus only on the scaling with iteration number T, the above theorem 
gives an overall convergence rate of 0(1/T). The dominant term in the above bound scales as 
O (y jl B ^ sl ^ d ^ ■ In a stochastic optimization setting where each stochastic gradient is based on 
drawing one fresh sample from the underlying distribution, the number of iterations T also corre- 
sponds to the number of samples seen. In such a scenario, the above iteration complexity bound 
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is unimprovable in general due to matching sample complexity lower bounds for the sparse linear 
regression problem [21]. This optimality is further clarified in the corollaries that we present below 
for the exact and approximately sparse cases. The corollaries are analogous to our earlier result in 
Corollaries [T] and [2] for the case of Lipschitz losses. 

Corollary 4. Under the conditions of Theorem^ we have the following guarantees. 

(a) Exact sparsity: Suppose that 6* is supported on a subset of cardinality s. Then there is a 
universal constant cq such that for all T > 2kt, we have 



slogd n 2 B 2 sifB 2 
-TP, 2 /v\ 1 + 6 + c ° rr 2 

T <in( S ) T <in( S ) 



\\0t ~ 0*M < co -^-7^(1 + S) + cq - -- (26) 



with probability at least 1 — 6exp(— 5logd/12). 

(b) Weak sparsity: Suppose 9* € M q (R q ) for some q € (0, 1]. Then there is a universal constant 
Co such that for all T > 2kt, we have 

fe-Hi<«*.{ f*;'° gd(i+ / )+fr r /2 <2T) 

with probability at least 1 — 6exp(— <51ogd/12). 



Part (a) of the corollary follows from observing that e 2 (9*; S) = in the result of Theorem (2J 
under our assumptions here. Part (b) involves setting S based on the assumption 8* G M q (R q ), 
analogous to the proof of Corollary [21 

A corollary analogous to Corollary [3] can also be obtained from Theorem [2j This involves 
replacing the use of the RSC assumption for the sample-averaged objective as before, and we leave 
such a development to the reader. 



3.3 A modified method with constant epoch lengths 

Algorithm Q] as described is efficient and simple to implement. However, the convergence results 
are based on epoch lengths T% set in an appropriate "doubling" manner. In practice, this setting 
might be difficult to achieve, since it is not immediately clear how to set the epoch lengths 
unless all of the problem parameters are provided. Juditsky and Nesterov [T4] address this issue 
by proposing an algorithm that uses fixed epoch lengths, and is also additionally robust to the 
knowledge of problem parameters such as the strong convexity and Lipschitz constant. In this 
section, we discuss how a similar approach with fixed epoch lengths also works in our set-up. At a 
coarse level, if we have a total budget of T iterations, then this version of our algorithm allows us 
to set the epoch lengths to O(logT), and guarantees convergence rates that are O ( (log T)/T), so 
at most a log factor worse than our earlier results. We note that unlike past work, our objective 
function changes at each epoch, which leads to certain new technical difficulties. 

For ease of presentation in stating a fixed-epoch length result, we assume r = and 7 = 7 
throughout this section. We further restrict ourselves to the setting of Theorem [1] with G{ = G 
and o~i = a, with the extension to least-squares case analogous to that for obtain Theorem [2j 
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Theorem 3. Suppose the expected loss satisfies Assumptions QJ [3] with parameters Gi = G, 7 and 
<7j = a respectively. Recalling the setting of [hi} suppose we run Algorithm [T] for a total of T 
iterations with epoch length Tj = T log cI/kt, and with parameter settings (|15p. Assuming that the 
above setting ensures that Tj = C(log d), for any subset S C {1, . . . , d} of cardinality at most s, 

||0 T -0*||1 = C? ^- ( G2 + CT2 ) lo g d +( w2 + lo g(^/ log d))cr 2 logd 



with probability at least 1 — 3exp(a; 2 /12). 

The theorem shows that up to logarithmic factors in T, not setting the epoch lengths optimally 
is not critical which is an important practical concern. We note that a similar result can also be 
proved for the case of least-squares regression. 



4 Proofs of main results 

We now turn to the proofs of our main results, which are all based on a proposition that characterizes 
the convergence rate of the updates updates © within each epoch. Proving this intermediate 
result requires combining the standard analysis of the dual averaging algorithm with the statistical 
properties of the minimizer of the epoch objective fi at each epoch. We then build on this basic 
convergence result using an iterative argument in order to establish our main Theorems Q] and El 



4.1 Set-up and a general result 

In our proofs, we use a weaker form of the local strong convexity (LSC) condition, known as locally 
restricted strong convexity, or local RSC. This weakened condition allows us to adapt our proofs to 
finite pool optimization (Corollary [3]) in a seamless way, and also to establish slightly more general 
versions of our main results: 



Assumption 2' (Locally restricted strong convexity) The function C : £1 — > K satisfies a i?-local 
form of restricted strong convexity (RSC) if there are non-negative constants (7, r) such that 

1(0) > 1(9) + (V£(0), 0-9) + ^\\9- eg - t\\9 - 9\\l (28) 
for any 9,9 G Q with ||0||i < R and ||0~||i < R. 

Note that this condition reduces to the standard form of local strong convexity in Assumption [2] 
when r = 0. The key weakening here is the presence of the additional tolerance term — namely, the 
quantity — t\\9 — 9\\f. Due to this term, the lower bound (|28p provides a nontrivial constraint only 
for pairs of vectors 9, 9 such that ^ \\9 — 0||| 3> t\\Q — 9\\1 . Since the ratio of the i\ and £2 norms is a 
measure of sparsity, the local RSC condition thus enforces local strong convexity only in directions 
that are relatively sparse. As a concrete example, if the difference 9 — 9 is s-sparse, then we have 
\\9 — 9\\f < s\\9 — 0||2, so that the condition ([28]) guarantees that 

C(9)-C(9) + (VC(9),9-9)} > X - {7 - 2sr}\\9 - 0\g (29) 
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a non-trivial statement whenever 7 > 1st. 

With this intuition, for applications of the condition (|28|) with r ^ 0, we introduce the effective 
RSC constant 

7 = 7 — 16st, (30) 

where we have introduced the factor of 16 for later theoretical convenience. In addition, we use a 
slightly generalized definition of the approximation-error term e 2 (9*; S), namely 



e 2 (9*;S,t) :-- 



3* l|2 

Vlli 



1+= 

7 



(31) 



which reduces to e 2 {9*; S) when r = 0. So as to simplify notation, we use fi(9) := C{9) + Ai||#||i 
to denote the objective at epoch i. Following standard notation in the optimization literature, 
we also require a quantity A^ such that A^ > if)(9) for all \\9\\i < 1; in our case, the choice of 
prox-function ([6]) ensures that A^ = elogd suffices. We also recall our notation uj 2 = uj 2 + 24 log i. 

We now state and prove a slightly generalized form of Theorem Q] that allows for r > 0. It is 
based on the epoch lengths 



ci 



^(^(G 2 + a 2 ) + ^V) + 



(32) 



where ci is a universal constant. The more general form of Theorem [1] also involves the quantity 



k t := log 2 



^R 2 T 



7 2 s 2 ((G 2 + a 2 )A^+uj 2 a 2 ) 



jlogd 

7 



(33) 



It applies to the dual averaging updates ([5]) with the epoch lengths (f32|) and regularization/stepsize 
parameters 



A 



sJTi 



A^(G 2 + a 2 ) + w 2 a 2 , and 0/ 



- 1 / 



{G 2 + \ 2 + a 2 )t 



(34) 



Theorem 4. Suppose the expected loss C satisfies Assumptions 2' and [3] with parameters 
G(-R) = G, (7, t), and <r(i?) = a respectively, and that we run Algorithm Q] with parameter set- 
tings (f34|) and epoch lengths (132]) . Then there is universal constant Co such that for any T > 2kt, 
for any integer s G [1, <i] such that 7 — 16sr > 0, and for any subset S C {1, . . . , d} of cardinality 
s, we have 



\\0T 



3*l|2 



< 



co 



S7 



[G 2 log d + a 2 (log d + w 2 + log k t )) + V(r ; S, r) 



(35) 



with probability at least 1 — 6exp(— cj 2 /12). 



In order to prove this theorem, we require some intermediate results on the convergence rates within 
each epoch. We state these results here, deferring their proofs to the appendices, before returning 
to prove Theorem [1] and its corollaries. 
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4.2 Convergence within a single epoch 



This intermediate result applies to iterates generated using the dual averaging updates ([5]) for Tj 
rounds with parameters (j34"|) , where G = Gj , and the error bound is stated in terms of the averaged 
parameter at the i t h epoch, namely the vector 0(Tj) := yr Ylt=i ^ ■ 

Proposition 1. Suppose C satisfies Assumptions^ 2' and{3\ with parameters Gi, (7, r) and Gi 

respectively, and assume that \\9* — yi\\ p < R{. Suppose we apply the updates ([5]) with stepsizes 
based on equation (|34|) . Then there exists a universal constant c > such that for any radius 
Rf > 47£ 2 (#*; S, t)/t, any integer s G [1, d] such that 7 — 16sr > 0, and any subset S C {1, . . . , d} 
of cardinality at most s, we have 



fMTi)) - m) < 30Ri 



Ti 



+ 



A. 

+ 30Ri\i\l — — and 



r\\\<cl 

7 



sRj 



RfA^p 



+ e'(9*;S,r) 



(36a) 
(36b) 



where both bounds are valid with probability at least 1 — 3exp(— uf/12) for any coi < 9\/logTi. 

On one hand, inequality (|36ap is a relatively direct consequence of known convergence results 
about stochastic dual averaging [22]. On the other hand, the bound (|36bp — which plays a central 
role in the our proofs — requires some additional statistical properties of the optimal solution at each 
epoch i. See Appendix [B] for further details on these properties, and the proof of Proposition [TJ 

Before moving on, we note that the bounds in Proposition Q] can be simplified further based on 
the parameter settings in equations (f32j) and ([34^ . Substituting these choices in our bounds yields 
the inequalities 



fi(9(Ti)) - fS) < c^ttl and 



HI? < c 



Ri +^se 2 (9*;S,T) 



(37a) 
(37b) 



In addition to this proposition, we need to state two more technical lemmas, the first of which 
bounds the error Aj := 0i — 9* . 

Lemma 1. At epoch i, assume that \\9* — yi\\ p < Ri. Then the error Aj = 6^ — 9* satisfies the 
bounds 



I2 < =VsXi + 2 

7 



|i < =s\i + 4 1 





|i +4r|| 


9* II 2 


V 7 


ls(\i\\9* c 


|i +4r|| 


/)* ||2\ 



and 



7 



+ 2I 



(38a) 
(38b) 



For future reference, it is convenient to note that the bound (|38bp implies that 



9 

7 



(39) 
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where we have made use of the elementary inequalities y/a + b < y/a + \fb and 2\fah < a + b, valid 
for all non-negative a, b. 

Our next lemma provides a simplified version of the RSC condition that holds under conditions 
of Proposition [TJ The lemma is stated in terms of the error 

A(Ti) := 0(Ti) - % (40) 

between the average 9(Ti) := Ylt=i ® l over trials 1 through Tj in epoch i, and the epoch optimum 9{. 
Lemma 2. Under conditions of Proposition [JJ and with parameter settings f|32|) and (|34p . we have 

^WHT^ < - fSi) + cr + se 2 (0* ; 5, r 

with probability at least 1 — 3exp(-w?/12). 

4.3 Proof of Theorem [4] 

We are now equipped to prove TheoremHl The proof will be broken down into cases, corresponding 
to whether T is "too large" or not. We recall that Kt is the total number of epochs performed 
after T steps in Algorithm [TJ 

We first consider iterations for which the bound 



R 2 K T >%e 2 (6*;S,T) (41) 

7 



is satisfied. We then provide an additional lemma which allow us to control the iterates for epochs 
i after which the squared R 2 violates the bound (141 1) . 

4.3.1 Proof assuming inequality ([41]) holds 

Our first step is to ensure that the bound \\6* — yi\\ p < Ri holds at each epoch i, so that PropositionQ] 
can be applied in a recursive manner. We prove this intermediate claim by induction on the epoch 
index. By construction, this bound holds at the first epoch. Assume that it holds for epoch i. 
Recall that the epoch length setting in Theorem [4] is of the form 



T,, = C 



^{A 4 ,{G 2 + a 2 ) + u 2 a 2 ) + ^ 
~ 1 R { 7 



where C > 1 is a constant that we are free to choose. Upon substituting this setting of Tj in the 
inequality ([36b j) . the simplified bound (j37bj) further yields 

||*Czi) - *l? < ^ (i?f + ^ 2 (0*;s,r)) <^Rl 

where step (i) follows due to our assumption R 2 > 4"fse 2 (6*; S,r)/^f. Thus, by choosing C suffi- 
ciently large, we may ensure that \\6{Ti) - 0*||? < R 2 /2 := R 2 i+l . Consequently, if 6* is feasible at 
epoch i, it stays feasible at epoch i + 1, and so by induction, we are guaranteed the feasibility of 
6* throughout the run of the algorithm by induction. 
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As a result, Lemma [2] applies, and we find that 



1 



WHTiM < fMTi)) - m) + cr ( ±B* + se z (9*; S,r 



7 D 2 



with probability at least 1 — 3exp(— uf/12). Appealing to the simplified form (|37a|) of Proposition [fl 
we can further obtain the inequality 



1 



2^2 



j-\\HTi)\\i <c^ + ct(1r* + S£ . {9 * s t) 
I s-f \7 

Recalling that 7 = 7 + 16sr, the above bound further simplifies to 



l\\A(m 2 2 <c 



+ sTe 2 (9*;S,T 



(42) 



We have now bounded A(Tj) = 9(Ti) — 9i, and Lemma Q] provides a bound on Aj = 9% — 9*, 
so that the error A*(Tj) = 9{Tj) — 9* can be controlled using triangle inequality. In particular, 
combining inequality (|38ap with the bound ()42p . we find that 



E[ + t_s£ 2 (9*;S,t) t sAf , A, ||0£ c ||i + t\\9* So \\1 
s 



1 



-I — 4. 

^ ^t2 ~ 



7 



1 



Since 2Ai||0£ c ||i < ^ + 
bound to obtain 



7ll y cc||l 



by Cauchy-Schwartz inequality, we can further simplify the above 



|A*(r0l|l<c 



R 2 | tse 2 {9*-S,t) , sAf , |{gy| 
s 



7 



TS 
1 + — 

7 



7 7 s 

Substituting our choice of Aj and Tj from equations ([34"|) and ([32]) respectively yields the final bound 

r i? 2 



\&*(Ti)\\i<c 



•S'T 



7 



a bound that holds probability at least 1 — 3exp(— to 2 /12). Recalling that R 2 = Rp-^ , we see 
that the error after % epochs is at most 



|A*(T,)i<c 



R 2 2 -(i-l) 



T 2(a*. 



S,T) 1 + 



ST 



1 



Since 7 = 7 + 16sr, some algebra then leads to 

■|j2 2 -(i-l) 



|A*(T<)||i<c 



7 



s 7 



(43) 



with probability at least 1 — 3^* =1 exp(— w 2 /12). Recalling our setting uf = oj 2 + 241ogi, we can 
apply the union bound and simplify the error probability as 

i i * 1 

^exp(-w 2 /12) = ^exp(-(u; 2 + 241ogj)/12) = e xp(-w 2 /12) ^ -. 



3=1 



3=1 



3=1 
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As a result, we can upper bound the net probability of our bounds failing after the Kt epochs 
performed by Algorithm [T] as 



^3exp(-w J 2 /12) < ^3exp(- Wj 2 /12) < 3exp(-w 2 /12) ^ — < ^-exp(-w 2 /12 



where the last step follows summing the infinite series. Finally noting that 7r 2 /6 < 2 gives us the 
stated probability of 1 — 6exp(— w 2 /12) with which our bounds hold. In order to complete the 
proof of the theorem, we need to convert the error bound (|43p from its dependence on the number 
of epochs Kt to the number of iterations T. This requires us to obtain Kt in terms of T, which 
we do next. Letting T(K) be the number of iterations needed to complete K epochs, we start by 
computing an upper bound g(K) on T(K) based on our epoch length setting (132j) . Then inverting 
the bound allows us to deduce the lower bound Kt > 5 , ~ 1 (T), which allows us to obtain error 
bounds in terms of T. 



K 



K 



i=l 



i=l 



1 A R 2 



SV (A 1 p(G 2 + a 2 ) + (co 2 + 241ogz> 2 ) + ^ 
(A^(G 2 +a 2 ) + (u 2 + 24 log K)a 2 ) £ 2*" 1 + ^K 



i=l 



1 



< c 



^4 (MG 2 + + + 241og K)a 2 ) 2 K + -^±K 



1 



where the last inequality sums the geometric progression. Inverting the above inequality to obtain 
g~ l (T), along with some straightforward algebra completes the proof. 



4.3.2 Case 2: Extension to arbitrary Kt 

As we observed in the previous section, when the bound (|4ip holds, we can ensure that 9* stays 
feasible at each epoch, thereby allowing us to use the error bounds from Proposition [TJ However, 
once T becomes large enough, the bound (|4ip will no longer hold, so that the the feasibility of 9* 
for subsequent epochs can no longer be guaranteed. In this section, we deal with this remaining 
set of iterations. In particular, let us define the critical epoch number 

K T :=argmax{fc> 1 | Rl > -^-s e 2 {9*; S, r)|, (44) 
beyond which the bound (jHJ no longer holds. By the definition of K T , we are guaranteed that 

R 2 K* > %e 2 (9*;S,T) { >R 2 k , t+1 ® I&./2, 

where inequality (i) follows since K T is the largest epoch for which the bound (|4ip holds, and step 
(ii) follows from our setting R 2 i+l = R 2 /2. 
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Now our earlier argument applies to all epochs k < K^,, and it guarantees that after epochs, 
we have 

\\VK}-0*\\l<c (^ + e\e*;S,T)l) < ce 2 (9*; S,r)l (45a) 



y K * T -e*\\\<c (Ri* T + TLe 2 {e*-s,T)\ < c ^. (45b) 



with probability at least 1 — 3 ( ^ -4) exp(— w 2 /12). 

i=i 1 

Our approach for the remaining epochs k > is to show that even though 9* may no longer 
be feasible, the error of the algorithm cannot get significantly worse than that at epoch Kj,. In 
order to do so, we need an additional lemma. 

Lemma 3. Suppose that Assumptions dj 2' and [3] are satisfied with parameters Gj, (7, r) and <7j 
respectively at epochs i = 0, 1, 2, . . .. Assume that at some epoch k, the prox center satisfies the 
bound — 9*\\2 < c\ Rk/y/s, and that for all epochs j > k, the epoch lengths satisfy the bounds 



J 



R 2 

Then for all epochs j > k, we have the error bound \\yj — 9*\\^ < C3 — with probability at least 
l-3EL fc+ iexp(-^ 2 /12). 

See Appendix Of 01 ' the proof of this lemma. 

Equipped with this lemma, the remainder of the proof is straightforward. Specifically, inequal- 
ity (|45ap ensures that for all epochs j > K^, we have 

\\y^-9*\\ 2 2 <c(^+s^e*;S,r)A < c^, 
T \ s 7 / s 

with probability at least 1 — 3 ( E i= ^ jz) exp(— w 2 /12). Here step (i) follows from our definition of 
K?p- Now we apply Lemma [3] to conclude that if Kt > K^, then 

\\y K * T -e*\\l<c( B ^\ <cle 2 (9*-S,r) 



1 



with probability at least 1—3 ( EjJ^ -4) exp(— w 2 /12). Finally, observing that the overall probability 
of our bounds failing is at most 6 exp(— cj 2 /12) as before, we see that the statement of Theorem [4] 
holds in this case as well, thereby completing the proof. 

4.4 Proofs of Corollaries [2] and [3] 

In this section, we will establish the corollaries of Theorem [TJ We start with Corollary [2] before 
moving on to Corollary [3l the latter needing our more general statement of Theorem 
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4.4.1 Proof of Corollary [2] 

The corollary follows from Theorem [1] by making a particular choice of S based on our assumption 
9* G Mg(Rq). Specifically, given a parameter ip > 0, define 

\S V \ = and := {j G {1, . . . ,d} \ \0*\ > \6*\ for all i <£ S v }, 

to be the set of R q ip~ q indices corresponding to the largest coefficients of 9* in absolute value. Given 
this definition, some straightforward algebra yields that \9*\ < ip for all i £ S^, which further yields 
||0gc||i < p 1 ~ q R q . (For instance, see Negahban et al. |18j for more detail on these calculations.) 
With these choices, the error bound of Theorem [1] simplifies to 

wot - mi < i^((G 2 + - 2 ) iogd + -V) + ^^}. 



This upper bound is minimized by setting ip* := + ° ^°^ d+u) \ substituting this choice and 
performing some algebra yields the claim of the corollary. 

4.4.2 Proof of Corollary [3] 

In order to prove this result, we must first demonstrate that the RSC condition holds. For notational 
simplicity, we introduce the shorthand 

n n 

£„(0):=£}£(0;(si,I/t)) = ^log[l + exp(yi(0, Xi ))]. 

i=l i=l 

Performing a Taylor series expansion of 9 around 9 yields 



1 n 

C n {9) - C n {9) - {VC n (9),9 -9) = -J2 ^K)^, 



n 

i=l 

where ip(t) = n^^^jp is the second derivative of the logistic function, and := (a9+{l— a)9, XiUi) 
for some a G [0, 1]. 

Under the assumptions of Corollary [3J we further know that < 2BR\, and hence that 
VK a «) > ip(2BRi). Consequently, in order to establish the local RSC condition ([28]) . it suffices 
to lower bound the quantity 



n 



-h\x{e-~e)\\l 



n * — * n 

where X G W ixd is the design matrix, with the vector xf as its i th row. Quantities of this form 
have been studied in random matrix theory and sparse statistical recovery. We state a specific 
result that holds under our conditions, and provide a proof in Appendix |Pl 

Lemma 4. Under the conditions of Corollary [3j there are universal constants c, c\ such that we 
have 

^ > ^%||! - c ^ max L mm(S ), -4-1 M ? for all v G M d 
n 2 n [ o- min (X;) J 

with probability at least 1 — 2exp(— cinmin(<T^ lin (S)/r/^;, 1)), 
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Consequently, the local RSC (|28p condition holds with 



1 



and t 



n 



max cj min (E), 



.1 



Or, 



Substituting these values in the general statement in Theorem [J] completes the proof of the corollary. 



4.5 Proof of Theorem [2] 

The main difference from the proof of Theorem Q] is that here we obtain improving bounds on the 
Lipschitz and sub-Gaussian constants at each epoch. Recalling that \\0 t — 0*\\i < 1R% at epoch 
i, a little calculation shows that Gi < 2||£|| 00 it!j, where ||£||oo is the elementwise norm of S. 
Since S is positive-semidefinite, we can further conclude that ||£||oo < p(E). Assuming further that 
IMloo < B, we see that < B. We further have the bound 

((x t ,0*> -yt)xt||« 
<2||(x tj e*- 



iy 1 1 cxd 



< %B A R 2 + 2B 2 w 2 . 



Since it^ ~ A/"(0, r/ 2 ), it is easy to check that 
E 



exp(w t /a Q ) < exp(l), where ao = er] 

2^2 



1 



< 3??, 



so that Assumption [3] is satisfied with a 2 = c [B 4 R 2 + B z r] 2 1 , for a universal constant c. Plugging 



rf = c [B*R 2 

these quantities into our earlier bound from Proposition [T] on the epoch length, we observe that 
with probability at least 1 — 3exp(— w 2 /12) the number of iterations needed at epoch i is at most 



Ti < c 



< c 



B A R 2 + V 2 B 2 ) (co 2 + logd) + 



7 log d 

7 



S2 -2 B y ("? + log rf ) + + !)(^ + log d) 

7 -^i 



7 

We can now mimic our earlier argument to obtain the total number of iterations across all epochs. 



4.6 Proof of Theorem H 

The proof relies on an additional technical lemma in addition to our earlier development. In order 
to prove the theorem, we observe that the key argument in the convergence analysis of Section [J] 
was the ability to reduce the error to the optimum 9* by a multiplicative factor after every epoch. 
However, with a fixed epoch length To, it may not be possible to continue reducing the error once 
the number of epochs becomes large enough. This is analogous to the difficulty we encountered in 
the proof of Theorem and will again be addressed using Lemma [3l We start by deducing the 
epoch number k* such that we successfully halve the error at each epoch up to k*. We will then 
use other arguments to demonstrate that the error does not increase much for epochs k > k* , and 
this requires some delicate treatment of our changing objective functions. Specifically, given a fixed 
epoch length To = OQogd), we define 

F:=supii : $/ 2+1 <^^J— , 2-^- for all epochs j < i \ . (46) 

\ s y {G 2 + a 2 )logd + co 2 a 2 * J 
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We start with a simple lemma, showing that Algorithm Q] run with fixed epoch lengths To has 
the desired behavior for the first k* epochs. 

Lemma 5. Suppose that Tq = O(logd) and define k* based on equation (|46p . Then we have 

||l/fc-0*||i<12fc and \\Vk-0k\\i<Rk for alll<k<k* + 1 

with probabilty at least 1 — 3k exp(— w 2 /12). Under the same conditions, there is a universal 
constant c such that 

\\yk-0*h<c-^ and \\y k - 9 k \\ 2 < c— ^ for all 2 < k < k* + 1. 

The key challenge in proving the theorem is understanding the behavior of the method after 
the first k* epochs. Since the algorithm cannot guarantee that the error to 9* will be halved for 
epochs beyond k*, we can no longer guarantee that 0* will even be feasible at the later epochs. 
However, this is exactly the same problem that arose in the proof of Theorem 01 Specifically, we 
can use Lemma [3] in order to control the error after the first k* epochs. 

In order to check the condition on epoch lengths in Lemma [21 we begin by observing that by 
the definition (1461) of k* , we know that 



' M G l+4*)+u 2 °l < Rl2 -k'/2-i = R k*+\ (47) 

Since we assume that the constants G k ^a k are decreasing in k, the inequality also holds for all 
k > k* + 1, so that Lemma [3] applies in our setup here. We further observe that the setting of the 
epoch lengths in Theorem [3] ensures that the total number of epochs we perform is 



log 



{G 2 + a 2 ) log d + u/V 



Now we have one of two possibilities. Either ko < k* or k^ > k* . In the first case, Lemma [5] 
ensures the error bound \\yk — 0*11% < c -^i / s applies. In the second case, we appeal to LemmaH 
and obtain an error bound of cR 2 ,/s. The proof is completed by substituting our choices of ko and 
k* in these bounds. 



5 Simulations 

In this section, we present the results of various numerical simulations that illustrate different 
aspects of our theoretical convergence results. We focus on least-squares regression, described in 
more detail in Example [2j Specifically, we generate samples (xt,yt) with each coordinate of xt 
distributed as Unif[— B,B] and yt = (6*,x t ) + wt- We pick 9* to be sparse vector with s = flog cf] 
non-zero co-ordinates, and wt ~ M(0,r] 2 ) where rf = 0.5. Given an iterate we generate a 
stochastic gradient of the expected loss ([T]) at (xt,yt)- For the £i-norm, we pick the sign vector of 
9 l , with for any component that is zero, a member of the £i-sub-differential. 

Our first set of results evaluate Algorithm [1] against other stochastic optimization baselines, 
where all algorithms are given complete knowledge of problem parameters. In this first set of 
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simulations, we terminate epoch i once 



\yi+i — 0*\\p < \\yi — 6*\\p/2, which ensures that 9* remains 
feasible throughout, and tests the performance of Algorithm [T] in the most favorable scenario. We 
compare the algorithm against two baselines. The first baseline is the regularized dual averaging 
(RDA) algorithm [30], applied to the regularized objective (j4|) with A = Arj^/log d/T, which is 
the statistically optimal regularization parameter with T samples. We use the same prox- function 
11*11 



so that the theory [30] for RDA predicts a convergence rate of 0(s y/log d/T). 



Our second baseline is the stochastic gradient (SGD) algorithm, a method that exploits the strong 
convexity but not the sparsity of the problem ([I]). Since the squared loss is not uniformly Lipschitz, 
we impose an additional constraint ||#||i < R\, without which the algorithm does not converge. The 
results of this comparison are shown in Figure [H where we present the error — averaged 
over 5 random trials. We observe that RADAR comprehensively outperforms both the baselines, 
confirming the predictions of our theory. 
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Figure 1. A comparison of RADAR with the RDA and SGD algorithms for d = 20000 (left) and 
d = 40000 (right). We plot — 0*111 averaged over 5 random trials versus the number of iterations. 



Our second set of results provides comparisons to algorithms that are tailored to exploit sparsity. 
Our first baseline here is the approach that we described in our remarks following Theorem [TJ In 
this approach, we use the same multi-step strategy as Algorithm [T] but keep A fixed. We refer 
to this as Epoch Dual Averaging (henceforth EDA), and again employ A = kr]\J (log d)/T with 
this strategy. To maintain a fair comparison with the RADAR algorithm, our epochs are again 
terminated by halving of the squared ^ p -error measured relative to 9* . Finally, we also evaluate the 
version of our algorithm with constant epoch lengths, as analyzed in Theorem [3] using epochs of 
length log(T), and henceforth referred to as RADAR-CONST. As shown in Figure EJ the RADAR- 
CONST has relatively large error during the initial epochs, before converging quite rapidly, a 
phenomenon consistent with our theoryJl Even though the RADAR-CONST method does not use 

2 To clarify, the epoch lengths in RADAR-CONST are set large enough to guarantee that we can attain an overall 
error bound of 0(1/T), meaning that the initial epochs for RADAR-CONST are much longer than for RADAR. 
Thus, after roughly 500 iterations, RADAR-CONST has done only 2 epochs and operates with a crude constraint set 
Q,(Ri/4). During epoch i, the step size scales proportionally to Ri/^/i, where t is the iteration number within the 
epoch; hence, when Ri is large, the relatively large initial steps in an epoch can take us to a bad solution even when 
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the knowledge of 6* to set epochs, all three methods exhibit the same eventual convergence rates, 
with RADAR (set with optimal epoch lengths) performing the best. Although RADAR-CONST 
is very slow in initial iterations, its convergence rate remains competitive with EDA (even though 
EDA does exploit knowledge of 9*), but is worse than RADAR as expected. 

Overall, our experiments demonstrate that RADAR and RADAR-CONST have practical per- 
formance consistent with our theoretical predictions. Although optimal epoch length setting is not 
too critical for our approach, better data-dependent empirical rules for determining epoch lengths 
remains an interesting question for future research. The relatively poorer performance of EDA 
demonstrates the importance of our decreasing regularization schedule. 



Error vs. iterations 
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Figure 2. A comparison of RADAR with EDA and RADAR-CONST for d = 20000 (left) and 
d = 40000 (right). We plot jj^* — 8*\\2 averaged over 5 random trials versus the number of iterations. 



6 Discussion 

In this paper, we presented an algorithm that is able to take advantage of the strong convexity 
and sparsity conditions that are satsified by many common problems in machine learning. Our 
algorithm is simple and efficient to implement, and for a d-dimensional objective with an s-sparse 
optima, it achieves the minimax-optimal convergence rate 0(s log d/T). We also demonstrate 
optimal convergence rates for problems that have weakly sparse optima, with implications for 
problems such as sparse linear regression and sparse logistic regression. While we focus our attention 
exclusively on sparse vector recovery due to space constraints, the ideas naturally extend to other 
structures such as group sparse vectors and low-rank matrices [18]. It would be interest to study 
similar developments for other algorithms such as mirror descent or Nesterov's accelerated gradient 
methods, leading to multi-step variants of those methods with optimal convergence rates in our 
setting. 

we start with a good solution As Ri decreases further with more epochs, this effect is mitigated and the error of 
RADAR-CONST does rapidly decrease like our theory predicts. 
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A Closed-form updates 

In this appendix, we derive a closed form expression for the update (|5b[) when O = M. d . Re- 
calling our definition of the prox-function ©, the constraint 6 t+1 £ fl(Ri) can be rewritten as 
^ViyRii®) — 2(p — 1)- We now form the Lagrangian at iteration t + 1 

where £ > is the Lagrangian parameter. The first-order optimality condition for the Lagrangian 
allows us to conclude that a t+1 fi t+1 + S/jp yu R i (6 t+1 )(l + £) = 0, so that the iterate at time t + 1 is 
given by 

/ rv t+l ii t+1 

" = V *K,R (-— f^y- 

where ?/>*. denotes the Fenchel conjugate [13]. Recalling the form of our prox-function, we 

have 

"byuRiW) = 2 (p - l)R 2 ^ 9 ~ Vi ^ p and V'w.fliCA*) = <y*»A*) + g HH?. 

where q = p/(p — 1) is the conjugate exponent to p. This is a straightforward consequence of the 
Fenchel duality of l v norms (e.g., see Example 6.0.2 [13 J. 

We can now take the gradient of the dual function to obtain the following closed form expression 



? t+1 = Vi + RKP (1 |^ +1 |^ 1 )sign(^ 1 )l|^ 1 l|( 2 ^). 



The value of £ can now be obtained by backsubstitution in the constraint 9 £ £l(Ri). Doing so and 
performing some algebra yields 

£ := max{0,(p-l)a t+1 \\fj, t+l \\ q Ri - l} , 

where \)i t+1 \^ q ~ 1 ^ refers to taking the absolute values and exponents elementwise and q = p/(p—l) 
is the conjugate exponent to p. 



B Proofs for convergence within a single epoch 

In this appendix, we prove various results on convergence behavior within a single epoch, including 
Lemmas [U and [2] as well as Proposition [TJ 
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B.l Proof of Lemma Q] 

By the optimality of 6>j, we have 

+ Xipih < £(0*) + ( 48 ) 



From the local RSC assumption (|28h . for any vector 9 feasible during epoch i, we have the lower 
bound 

1(9) > C(9*) + (V£(9*),9- 9*} + 1\\6 - 9*\\l - t\\6 - 9*\\j 

> C(9*) + |||0 - 9*\\l - r\\9 - 0*||f, (49) 

where step (i) follows since 9* minimizes C(9). Applying inequality (|49p with 9 = 9{ and combining 
with the initial bound (j48j) yields 

C{9i) + AiH^lli < C(9 t ) - Ifa - + r\\9i - + A;||0*||i. 
Using the definition 9* = 9i + Aj and triangle inequality, we can further simplify to obtain 

< A» || 0* ||i — 2" ||^* ~~ 0i||| + T \\9i ~~ 0*||i — Aj||0j||i + Aj||Aj||i — — ||Ai||| + r||Aj||f , 
Rearranging the terms above yields 

|||Ai||l < AiHAiHi + r||Ai||? < 2^A i ||A i || 2 + 2A i ||0* ?c ||i + 8*r||Ai||l + 8t||0£c||?. 
Some elementary algebra then yields that e := ||Aj||2 satisfies the quadratic inequality 

I{ 7 _ i 6sr }e 2 - {2^7sX l }e - {2Ai||0$=l|i + 8t||0£ c ||?} < 0, 
which then implies the ^-error bound (|38aj) . 

In order to establish the £i-error bound ([38b]) . we require an auxiliary lemma that allows us to 
translate between the £2 and ^i-norms: 

Lemma 6. For any pair of vectors 9,9 E 0, suppose that ||#||i < ||0||i + e for some e > 0. Then 
for any set A C {1, 2, . . . , d}, the vector A := 9 — 9 satisfies the inequality 

||Aa«||i <||A A || 1 + 2||0 Ac || 1 + e. (50) 

Proof. Since A and A c are disjoint, the bound assumed in the lemma statement can be written 

\\9a\\i + \\9a4i < ll^||i + ||^||i + e- (51) 
Since 9 = A + 9 by definition, triangle inequality implies that 

||£a||i> ||0a||i-||a a ||i and \\§a4i> I|Aa=||i-||^||i- 

Substituting into the bound ([51]) . we obtain 

||A A c||i - ||0 A <=|| + IIMIi - IIAaIIi < ll^lli + ll^lli + e, 
and rearranging terms completes the proof. □ 
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In particular, applying Lemma [6] to the pair 9 = 9$, and 9 = 9* with the tolerance e = and 
the subset A = S, we find that the error vector Aj := 9i — 9* satisfies the bounds 

IKA^Hi < ||(Ai)s||i + 2||^c||i. (52) 

Consequently, we have 

||A 4 || 1 = ||(A 4 )s|| + ||(A.0<H|i < 2||(A 4 ) 5 ||i +2||0£ c ||i < 2>/5||(A i )s|| 2 + 2||0£ o ||i, (53) 
where the final step uses the fact that (Aj)s is an s-vector. 

B.2 Proof of Proposition [It Inequality (I36al) 

We are now equipped to prove Proposition [IJ beginning with the first bound (|36ap . Introducing 
the convenient shorthand e* = g l — VC(9 t ), our assumptions guarantee that there are constants Gi 
and <7j such that Eexp ( ^ e |°° J < exp(l), and 



\C{9) - C{9)\ < Gi\\9 - 9\\i for all 9,9 satisfying ||0-jfc||i < Ri- (54) 

Our starting point is a known result for the convergence of the stochastic dual averaging algorithm. 
Recalling the definition 9(T) = Ylt=i ^/T, and letting g 1 = g t + be the stochastic subgradient 
at iteration t, we have 



where 7^, is the strong convexity coefficient of the prox-function with respect to the norm, equal 
to l/(ei??) in our case. This bound follows directly from the analysis of Nesterov |22j and Xiao |30| : 
the specific form (I55D given here corresponds to Lemma 2 of Duchi et al. [8]. 

Now observe that since g l = VC(9 t ) + \y + e , triangle inequality yields the upper bound 

\t\to <2(||V£(^) + Ay||L + ||e t ||L) < 4Gf +4Af + 2||e'" 



OO ' 



where inequality (i) uses the Lipschitz condition in Assumption [H and the Lipschitz property of the 
^l-norm. From this point, further simplifying the error bound (|54p requires controlling the random 
terms 

T T 

t=i t=i 

Accordingly, we state an auxiliary lemma that provide tail bounds appropriate for this purpose. 

Lemma 7. Under the sub-Gaussian tail condition (Assumption [3]) : 
(a) With step sizes a* = a/y/t, we have 

T 

E«* _1 ||e*llL < 2of aVT + afuay/2 log T (57) 
t=i 

with probability at least 1 — 2 exp(— w 2 /12) for all u < 9\/\og T. 
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(b) We have Et=i( e > e i ~ e ) < uRiPiVT with probability at least 1 - 2 exp(-w 2 /12). 

See Appendix lB.5l for the proof of this result. We now use it to control the terms in equation (1561) . 
Starting with the first term, we observe from Lemma [7] (a) that for T > 1, we are guaranteed that 



E« t ~ 1 i^iiL<E« t " 1 ( 4G ?+ 4A ' + 2 ii et 



,t||2 \ 
loo J 

t=l t=l 



< 4(Gf + Af) E a'" 1 + 2afa(2VT + aj i y / l^T) 



t=i 

< 8{G 2 + Af )aVT + 22crf aVf, 

with probability at least 1 — 2exp(— uf/12). Here the last step uses the inequality 91ogT < \[T 
valid for all T > 1, as well as the assumption uii < 9^/\ogT. Thus, we have established an upper 
bound on the gradient terms with an effective Lipschitz constant 

T 

£ ^WWlo < 22aVT(G 2 + Af + a 2 ). (58) 
t=i 

Part (b) of Lemma [7] directly controls the second random quantity. 

We now plug in the results of these lemmas into our earlier error bound (|55|) . which yields, with 
probability at least 1 — 3 exp(— uf /12), the upper bound 

fMT)) - fS) < -^=(Gf + Af + a 2 ) + ^= + ^ 



2 7v ,VT aVT VT 



<10jG 2 + a 2 + xd^ + UmRi 



Here the second inequality uses the setting a = 5y A^^/(G 2 + Af + erf). We also note that under 

our assumption that ip is 1-strongly convex with respect to || • ||i, we have that 7^ = l/(eR 2 ) at 
epoch i. Thus with probability at least 1 — 3exp(— wf /12), we have the error bound 



fmn - m) < sorai MGi t ai + Xi) + UiatRt 



T JT 



< 30R,\l MG l +ai) +30R i xJ^ + ^Si, (59) 



T V T y/T 

thus completing the first bound in Proposition [TJ 

B.3 Proof of Lemma H 

The main idea of this proof is to first convert the error bound (|59p from function values into l\ and 
^2-norm bounds by exploiting the (approximate) sparsity of 9*. We will then use these bounds to 
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simplify the RSC condition. Since the error bound (|59[) for the minimizer 9i, it also holds for any 
other feasible vector. In particular, applying it to 6*, we obtain the bound 



fmrn - sm < c ^J M<%+<t) + 3QRi M± + (60) 

m v * V * *v « 

Our next step is to lower bound the left-hand side of this inequality. We have 

- /;(#*) = £(0(Ti)) + Xi\\e{Ti)h - - M\0*\\i 
> C(9*) + \i\\8(Ti)\\i ~ £{0*) ~ Ai||0* ||i 

=^{11^)111 -nil}. 

where inequality (i) follows since 9* minimizes C. Combining with the bound (|60p yields 



l*P»ll. < Vh + ^i. M±^> + 30*, /£ + . (6i) 

At this point we recall the shorthand notations A*(Tj) = 9{Ti) — 9* and A(Tj) = 0(Tj) — 0j. In 
order to bring the above bound on ||0(Tj)||i closer to the statement of the lemma, we can appeal 
to Lemma El Indeed an application of the lemma in conjunction with the inequality (|6ip results in 
the bound 



|A*(T 4 )Hli < HA*m)slli + ^J MGl + <rl) + IA± + upR* + (g2) 



Our next step is to convert the above cone bound on A*(Tj) into a similar result for A(Tj). In 
order to do so, we observe that A* (2$) — A(Tj) = 9i — 9*, and hence 

||$ - r||i = ||A£(Ti) - As^lli + ||Ajc(ri) - ^(roiii 

> {HAK^Il! - IIAsCTOHi} - {||AJc(T0||i - ||A 5 c(T0||i}, 

and hence 

IIAsc^Hi - WAsiT^h < \\A* Sc (Ti)\\i - IIA^TOHi + 11$ - 

Consequently, Lemma [T] provides the final piece to complete the proof. Combining inequality ([39]) 
obtained from Lemma [H with our earlier bound (|62p yields 



llSpi),!, < ||Sp S ) s || I+ «^ +r ,|| 1 ( 6 + 8 yf )+^MM +3 „^| + ^. 

Consequently, a further use of the inequality ||A(Tj)s||i < ^/s||A(Ti)||2 allows us to conclude that 
there is a universal constant c such that 



^ + 11^11? (l + f )) (63) 



| A||? < 8*|| A||i + c ( + _^ (^ (G? + a 2) + ^ + 
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with probability at least 1 — 3 exp (-w?/12). Substituting the settings (JSJ) and (J32J) of A; and Tj 
respectively into the above bound yields 



,2 7 



(64) 



where we recall the notation e 2 (6*; S,r) = s s 1 (l + S). 

In order to complete the proof, we now invoke the RSC assumption applied to the function /j. 
Specifically, since 9i minimizes fi, the RSC condition implies that 



7, 



Combining the above inequality with the bound (|64p yields 



1 



l|A(roill </i(0(TO)-/i(0i) + T 



A(ri)||l < /i(e(T0) - + r||A(T0||?. 
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7 



Rearranging terms and recalling the notation 7 = 7 — 16st completes the proof. 
B.4 Proof of Proposition [D Inequality (I36bl) 

Equipped with LemmaEl we are now ready to prove the second part of Proposition [TJ In particular, 
using the inequality (]63j) in the proof of Lemma we observe that with probability at least 
1 — 3 exp(— ujf/ 12), we have 



s 2 \ 2 



R 



R 2 A^ 



A||? < 8*|| Ag + c ( + ^ (^(G? + a 2 ) + ^ 2 a 2 ) + ^ + seV ; S, r 



1 



< ^(/,(0(T,)) - fS) + r\\A\\j) + c + ^ (^(G 2 + vD+ufaf) + t^L + s£ 2(0* 



s 2 A 2 _/g 



Here the second inequality uses the local RSC condition (|28p . and the fact that Oi minimizes fi. 

From hereonwards, all our inequalities hold with probability at least 1 — 3exp(— w 2 /12), so that 
we no longer state it explicitly. Rearranging terms and recalling the definition (|30p of 7, we obtain 
that 



7, 
7 



16s 



s 2 \ 2 R 



R 2 A t 



A||? < — {mm - fM)) + c (-gi + ^7 (^(G 2 + a 2 ) + + + S e 2 (r ; 5, r) 



Combining the above bound with our earlier inequality (j59[) yields ^||A|| 2 < $1 + < I>2> where 
16S.R, 

$1 := 



4\/A 4 ,(G 2 + cj 2 ) + 4X iV r A^ + ujidij , and 
2 R 2 A. V 



D 2\2 



^2 := c (^A + A_ (^(G 2 + a 2 ) + ^) + + se 2 (P- S, r)) 



(65a) 
(65b) 



By the Cauchy-Schwartz inequality, we have 



^iAi . /S < 2^? + 2^, and hence 



$1 < 



16si?i 

WW 



4^^ + a/) + ^ ) + 128 ( + 



s 2 \ 2 RfA^p 



T 



(66) 
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Noting that $1 and $2 involve multiple terms, some increasing and others decreasing in Aj, we 
optimize the choice of Aj, in particular by setting 



Rq 



M<% + <>?) + 



(67) 



Using this setting and combining the upper bound ^||A||f < $1 + $2 with the form (I65bj) of <I>2 



and the upper bound (|66j) on $1, we find that 

sRi 



|A||?<c 



RH 



Combining the above inequality with the error bound (|39p for 9i and triangle inequality leads to 
||A*|| 2 < 2||A|| 2 +2\\0* - 9i\\\ < 2||A||if + 



12 , 162s2X * +cs£ \e*;S,T) 



7 



7 (s 2 \ 2 



<2||A||J + ^c{ ^+ se 2 (9*;S,T 
7 V 7 

where the second inequality follows since 7 > 7. Substituting the setting (|67p of Aj yields an upper 
bound identical to our earlier bound (|68p with different constants. 

Finally, in order to use 9{Ti) as our next prox-center y^+i, we would also like to control the error 
H^(Tj) — Since Aj+i < Aj by assumption, we obtain the same form of error bound (|6"5j) . We 

want to run the epoch till all these error terms drop to R 2 + i '■= R 2 /2- Recalling our assumption 
that ~/£ 2 (9*; S,t)/j < R 2 /^, it suffices to set the epoch length Ti to ensure that 



sRg 



sR a . R 2 , iR 2 A it 

uj;(jj < — , and c — 

12 7^ 



7^V--- ■ -w - 12 
All the above conditions are met if we choose the epoch length 

s 2 7~ , ^ 



< 



12 



Ti = C 



^R 2 ' ' ' " 7 



(^(G 2 + a 2 )+u, 2 a 2 ) + ^- 



for a suitably large universal constant C, which completes the proof of the second part of the 
proposition. The stated bound in function values follows from substituting the choice of Aj in our 
earlier bound ()59[) and some straightforward algebra. 



B.5 Proof of Lemma [7] 

It remains to prove LemmaEl a result used during the proof of Proposition!]] We do so by exploiting 
some classical martingale tail bounds of the Azuma-Hoeffding type. The particular result given 
here is due to Lan et al. j!6j : 

Lemma 8. Let zi,Z2,--- be a sequence of i.i.d. random variables, let at > 0, t = 1, 2, . . . be a 
sequence of deterministic numbers, and let <f>t = </>t(V) be deterministic (measurable) functions of 
z* = (z\, Z2, ■ ■ ■ , zt). Using Ft to denote the a- field of z t , we have: 
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(a) Suppose E[</>t | Ft-\\ = with probability one and E[exp(c/> 2 /<7 2 ) | Ft-i\ < exp(l) with 
probability one for all t. Then 



i i 

[J2& >6 (Y1 a t) 1/2 - ex P(-^/3) for all 5 > 0. 



(69) 



(b) Suppose that E[exp(|<^|/<7i) | Tt-i] < exp(l) w.p. 1 for t. Letting a T = (0*1, . . . , <tt), we 
have the bound 



< exp(-<5 2 /12) + 

exp 



3lk T l 



P J2</h > ||a T ||i + <% T || 2 
_t=i 

< exp(-<5 2 /12) + exp(-3«5/4) 
We now use this lemma to prove parts (a) and (b) of Lemma [71 



4|k T | 



(70) 



B.5.1 Proof of part (a) 

We start by showing that the conditions of Lemma [8] are satisfied. Indeed, by Assumption [3l we 
have 



Eexp 



Eexp 



,t||2 



07 



< exp(l). 



Consequently, in order to satisfy the condition of Lemma EJb), it suffices to set a t = erf a 1 1 . 
Recalling our choice a* = a/\/t, we find that 



\ a 1 = °i 



1 1 

£a< = afc£-^<2<7?a\/f, 



t=i 



a 2 = °i 



\ t=i 



1 2 

A — a, a. 



1 

E - < ofaV21ogT and 



<7 oo 



ofa. 



Plugging the above quantities in the statement of Lemma [8](b) with 5 = Ui yields 



T 

'[XX^He*!!^ > 2a?aVT + log T < exp(-w 2 /12) + exp 



4ci 2 a 



< exp(— w 2 /12) + exp 
<2exp(-w 2 /12), 



3Vfog~T 



-LO, 



where the last inequality uses our assumption uji < 9-v/log T, thus completing the proof. 
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B.5.2 Proof of part (b) 

We now turn to part (b) of Lemma [71 By assumption, we have E[e* | J-'t-i] = 0; moreover, the 
random variable 9 t is measurable with respect to J~t~i and Qi is deterministic. Consequently, the 
first condition of Lemma EJa) is satisfied. For the second condition, we observe that by Holder's 
inequality and Assumption [31 we have 

Eexp H^r - Eexp *m 



< Eexp ^ \^ a 2 l ) = Eex P ^^^J ^ 

Here inequality (a) uses the facts that ||#* — yi\\\ < Ri and \\9i — yi\\\ < Ri by the definition of 
our updates ©, so that the conditions of Lemmata) are satisfied with a t = 2oiR, L . Plugging this 
setting in the result of the lemma and setting 5 = uj/2 completes the proof. 

C Proof of Lemma [3] 

The proof of this lemma is based on pair of auxiliary results, which we begin by stating. 

Lemma 9. Suppose at some epoch k, we have the bound \\yk — 9*\\i < Rk- Then for all epochs 
j > k, we have 

\\yj-o*h <8R k . 

Lemma 10. Under the conditions of Lemma [31 for any epoch i > k, we have 

/i(y<+0 - fM < c^2-(- fc V 2 . (71) 

See Sections IC.2I and IC.3I for the proofs of these two lemmas, respectively. 
C.l Main argument 

With these auxiliary results in hand, we now turn to the proof of Lemma [3j By the definition of 
fi, we have 

£(y i+1 ) - £(9*) = (£y i+1 - £( Vi )) + (£( yi ) - C(6*)) 

= (fi(y i+1 ) - fM) + AfefeHi - \\y i+ i\\i) + {C{ yi ) - £(9*)) 

§ c thJL 2 -(i-fc)/2 + \ lRi + (C( yi ) - 1(9*)), 

where step (i) follows from a combination of Lemma [TUl and triangle inequality along with the 
feasibility of yi + \ at epoch i since \\yi\\\ — ||yi+i||i < \\yi — yi+x\\\ < Ri- By applying the Cauchy- 
Schwarz inequality to the second term in the bound above, we obtain 

C(y i+1 ) - £(9*) < c X- + ^ + -± + (£( Vi ) - £{9*)) 

S7 2s 27 

< C 1 + (£(y fc) _ £(#*)), 
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where the last inequality uses the setting (|T5|) of Aj and the setting (|32|) of Tj. Recursing the 
argument further yields 

£( yj )-£(6*)<c -^— s + (C(y k+ i) - C(9*)) 

i=k+l S ^ 

<c - + (£(y k+1 )-£(6*)), 

(V2 - l)s7 

where the second inequality upper bounds the sum of the geometric progression. Recalling the 
given conditions in the lemma, we obtain the upper bound 

£(y k+l ) - l{6*) < fk(Vk+l) - fk(0*) + 2X k R k < c - 7 



Combining the two previous bounds yields £{yj) — £{9*) < c — 



S7 

Our last step is to apply the RSC condition to this inequality. Since 8* minimizes £{0), we have 



\hi - 0*111 < C{v,) - £(9*) + r\\ yj - e*\\l < c 



R 2 kl 2 , ^2 



+ TRu 



where the second inequality uses Lemma [9l Finally, we recall that 7 = 7— 16sr, which allows 

us to further simplify the above upper bound to %\\yj — ^ C ~F"' an d observing that 7 < 7 
completes the proof. 



C.2 Proof of Lemma [9] 

The proof of this lemma is straightforward given the definition of our updates ©. At any epoch 
j > k, the prox center yj is feasible at epoch j — 1, so that 

\\yj ~ Vj-ih < e\\yj - Vj-i\\i < eRj-i, 

where we have used the fact that < e||0|| p , by our choice © of p. Consequently, by definition 
of the updates ([5]), we have 

11% - 6**111 < hj - yj-i\\i + hj-i - 

< eRj-i + ||%_i - yj-2\\\ + \\yj-2 ~ Q*\\\- 

By repeating this argument, we may unwind the error bound until we reach epoch k, thereby 
obtaining the bound 

3 (i) 3 

\\Vj-9*\\i< Yl eRi + \\y k -9*\\i < ^ eR * + R k, 

i=k+l i=k+l 

where inequality (i) follows by the lemma assumption. 

Finally, we observe that the last term from epoch k is controlled by assumption in the lemma. 
As a result, we can further obtain 

00 

1 1 yj - 0* 1 1 1 < eR k+1 Y V2^ + R k , 

j=0 

Summing the geometric progression and noting that e/(\/2 — 1) < 7 completes the proof. 
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C.3 Proof of Lemma 1101 

Note that at any epoch i, the prox-center yi is always feasible by construction. As a result, 
equation (|59[) guarantees that 



fiiVi+i) ~ fi{Vi) < cRi 



(0 

< cRi 



A^(Gf + of) + ujiGi 



T 3 ; 



(72) 



with probability at least 1—3 exp(— w?/12), where step (i) uses the elementary inequality 2ab < a 2 + b 2 . 
Recalling our setting f|34j) of the regularization parameter A;, we find that 



i*i7 "V T * v 7 ^ 

Substituting this upper bound in our earlier inequality ([72]) yields 



hiVi+i) ~ fiiVi) < cRi 



A^(G 2 + a 2 ) + uai R . lA ^ 



+ 



sTi 



Under the conditions of Lemma 02 the first term in the above inequality is at most ^ 2 R^/{2s^) for 
any i > k. Further recalling the assumption that Tj = 0(A 1 / J ), we see that for any i > k 



fiiyi+x) - fi( yi ) < cRi i 2 R k /{2s 1 ) + 



«7 



—2 



c 2L(#2 2 -(fc-l)/2 2 -(i-l)/2 + R 2 2 -(i-l)^ 

S7 

p2o-(fc-l)=2 

7_^ 2 -(i-fc)/2 +2 -(i-fc)j 



S7 



< c fc7 2- ( ^ fc)/2 , 
S7 



which completes the proof. 



D Proof of Lemma |4] 

Results of this flavor have been established in prior work (e.g., [25] I17j). We provide a proof here 
for completeness, building on the result of Loh and Wainwright [T7J. Following their notation, we 
define the ^-"ball" of radius k, namely the set Bo (A;) := {# | 9j ^ for at most k indices}, as well 
as the set 

K(k) = Bo(jfe) nB 2 (l). 
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We establish our claim by appealing to Lemma 12 of Loh and Wainwright, applying their result 
with the settings 

X T X n foiL-fE) 

T = S and the sparsity parameter k := cq mm < , 

n log a I Vx 



where Co is an appropriate universal constant chosen to ensure that k > 1. Based on this result, 
we see that it suffices to establish 



v I E | 7; 

n 



< (7min(S) for all v G K(2k). 
~ 54 



Under Assumption H] on the sub-Gaussianity of the design matrix X, we can establish the 
above condition by appealing to Lemma 15 of Loh and Wainwright [T7]. Specifically, we apply 
their result with t = °" m ' n ( s ) anc l with s = k as defined above. Then we can mimic the ar- 
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gument in the proof of Lemma 1 in the paper [17] to conclude that with probability at least 
1 — 2exp(— cinmin(<T^ lin (S)/r/^, 1)) we have the bound 



T ,X T X 
v E ] v 



< ( \\vf 2 + l ^P\ for all v G R d . 

n 2 V k 

Substituting our setting of k and rearranging terms completes the proof. 

E Proof of Lemma 

The base case for the ^i-error bound at k = 1 is true by our assumption that < R\. As a 

result, the convergence analysis of Proposition Q] applies at the first epoch. Assuming that k* > 1, 
our setting of k* ensures that 



R lS jA^Gl + aD + uoi R 2 
— 7= < lc— 1 = cR\. 



7 vn 1 

Since T = 0(A 1 p) by assumption, the R\A^/Tq term can also be further upper bounded by cR\. 
Hence, as long as R\ > 2e 2 (8*; S,t), we obtain the stated l\ error bound at the second epoch by 
applying equation (|36bj) from Proposition [TJ A similar calculation using equation I36al yields 

IWKn) - Hi < f2(e(T )) - f 2 (9 2 ) < c ^L. 

I s 

Finally, we can obtain a similar bound up to constant factors on ||^(Tq) — 0*^ as well by combining 
with the ^2-error bound of Lemma [1] as before. Thus, we obtain our inductive claim for k = 2. 
Assuming the inductive hypothesis for arbitrary i < k* + l, the reasoning for obtaining the inductive 
claim at i + 1 is exactly identical to the above arguments, completing the proof of the lemma. 
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